In [57]:
!jt -t grade3
# !jt -r

Simple Linear Regression¶



Simple linear regression is a statistical method used to model the relationship between a single independent variable (predictor) and a continuous dependent variable (response). Here's a brief introduction to simple linear regression, covering its purpose, strengths, weaknesses, and the types of data it works best on:

Purpose:¶

The purpose of simple linear regression is to understand and quantify the linear relationship between two variables. It helps in predicting the value of the dependent variable based on the value of the independent variable. The model assumes a straight-line relationship between the variables and estimates the parameters (slope and intercept) that define this line.

Strengths:¶

  1. Interpretability: Simple linear regression results in a straightforward model represented by a linear equation (y = mx + b), making it easy to interpret and communicate.

  2. Prediction: It provides a simple method for making predictions. Once the model parameters are estimated, you can use the model to predict the dependent variable for new values of the independent variable.

  3. Visualization: The relationship between variables can be visualized easily with a scatter plot and the fitted regression line.

Weaknesses:¶

  1. Assumption of Linearity: Simple linear regression assumes a linear relationship between the variables, which may not be suitable for data with non-linear patterns.

  2. Sensitivity to Outliers: The model can be sensitive to outliers, and a single influential data point can significantly impact the estimated parameters.

  3. Assumption of Independence: The model assumes that observations are independent. If there is dependence between observations, the model's performance may be affected.

Data:¶

Simple linear regression works best when:

  • There is a linear relationship between the independent and dependent variables.
  • The variability in the dependent variable can be explained by changes in the independent variable.
  • Residuals (the differences between observed and predicted values) are approximately normally distributed.

Use Cases:¶

  • Sales and Revenue: Predicting sales based on advertising spending.
  • Temperature and Energy Consumption: Understanding the relationship between outdoor temperature and energy usage.
  • Exam Scores and Study Hours: Predicting exam scores based on the number of study hours.

Considerations:¶

  • It's essential to assess the assumptions of simple linear regression, including linearity, independence, and normality of residuals.
  • Simple linear regression is often a starting point for more complex regression analysis when dealing with multiple predictors.

In summary, simple linear regression is a valuable tool for modeling and understanding the linear relationship between two variables. Its simplicity makes it a good choice for situations where a linear model is appropriate and interpretable.

1. Download data from Kaggle¶


In [38]:
# install the opendatasets package
# !pip install opendatasets
import opendatasets as od

# download the dataset (this is a Kaggle dataset) during download you will be required to input your Kaggle username and password
od.download("https://www.kaggle.com/datasets/ramlalnaik/fuelconsumptionco2?select=FuelConsumptionCo2.csv")
Skipping, found downloaded files in "./fuelconsumptionco2" (use force=True to force download)

Alternativey one could use...¶

In [ ]:
# an alternative approach in case that was too simple :)
import requests

async def download(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)
    else:
        print(f"Error: {response.status_code} - {response.reason}")

# Example usage
url = "https://www.kaggle.com/datasets/ramlalnaik/fuelconsumptionco2?select=FuelConsumptionCo2.csv"
filename = "FuelConsumptionCo2.txt"

await download(url, filename)

2. Import Packages¶


In [63]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

3. Read and Summarize the Data¶


In [64]:
df = pd.read_csv("/Users/davidfoutch/Desktop/CourseEra/IBM_AI/FuelConsumptionCo2.csv")
# take a look at the dataset
print(df.describe())
df.head()
       MODELYEAR   ENGINESIZE    CYLINDERS  FUELCONSUMPTION_CITY  \
count     1067.0  1067.000000  1067.000000           1067.000000   
mean      2014.0     3.346298     5.794752             13.296532   
std          0.0     1.415895     1.797447              4.101253   
min       2014.0     1.000000     3.000000              4.600000   
25%       2014.0     2.000000     4.000000             10.250000   
50%       2014.0     3.400000     6.000000             12.600000   
75%       2014.0     4.300000     8.000000             15.550000   
max       2014.0     8.400000    12.000000             30.200000   

       FUELCONSUMPTION_HWY  FUELCONSUMPTION_COMB  FUELCONSUMPTION_COMB_MPG  \
count          1067.000000           1067.000000               1067.000000   
mean              9.474602             11.580881                 26.441425   
std               2.794510              3.485595                  7.468702   
min               4.900000              4.700000                 11.000000   
25%               7.500000              9.000000                 21.000000   
50%               8.800000             10.900000                 26.000000   
75%              10.850000             13.350000                 31.000000   
max              20.500000             25.800000                 60.000000   

       CO2EMISSIONS  
count   1067.000000  
mean     256.228679  
std       63.372304  
min      108.000000  
25%      207.000000  
50%      251.000000  
75%      294.000000  
max      488.000000  
Out[64]:
MODELYEAR MAKE MODEL VEHICLECLASS ENGINESIZE CYLINDERS TRANSMISSION FUELTYPE FUELCONSUMPTION_CITY FUELCONSUMPTION_HWY FUELCONSUMPTION_COMB FUELCONSUMPTION_COMB_MPG CO2EMISSIONS
0 2014 ACURA ILX COMPACT 2.0 4 AS5 Z 9.9 6.7 8.5 33 196
1 2014 ACURA ILX COMPACT 2.4 4 M6 Z 11.2 7.7 9.6 29 221
2 2014 ACURA ILX HYBRID COMPACT 1.5 4 AV7 Z 6.0 5.8 5.9 48 136
3 2014 ACURA MDX 4WD SUV - SMALL 3.5 6 AS6 Z 12.7 9.1 11.1 25 255
4 2014 ACURA RDX AWD SUV - SMALL 3.5 6 AS6 Z 12.1 8.7 10.6 27 244
In [115]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook'

# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=columns)

# Populate subplots with histograms
trace0 = go.Histogram(x=df['ENGINESIZE'])
trace1 = go.Histogram(x=df['CYLINDERS'])
trace2 = go.Histogram(x=df['FUELCONSUMPTION_COMB'])
trace3 = go.Histogram(x=df['CO2EMISSIONS'])
# Update layout

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 2)

fig.update_layout(
    title='Histograms of Selected Columns',
    showlegend=False
)

# Show the plot
# fig.show()
In [111]:
import plotly.graph_objects as go
%matplotlib notebook

# Sample data
# Sample data
x_values = df['FUELCONSUMPTION_COMB']
y_values = df['CO2EMISSIONS']

# Create a scatter plot with styling
fig = go.Figure(data=go.Scatter(
    x=x_values,
    y=y_values,
    mode='markers',
    marker=dict(
        color='darkorange',
        size=12,
        line=dict(color='gray', width=1),
        symbol='circle',
        opacity=0.8
    ),
    line=dict(
        color='green',
        width=2,
        dash='dash'
    )
))

# Update layout
fig.update_layout(
    title='Emissions and Fuel Consumption',
    xaxis_title='FUELCONSUMPTION_COMB',
    yaxis_title='Emissions',
    template='ggplot2'
)
In [112]:
import plotly.graph_objects as go
%matplotlib notebook

# Sample data
# Sample data
x_values = df['ENGINESIZE']
y_values = df['CO2EMISSIONS']

# Create a scatter plot with styling
fig = go.Figure(data=go.Scatter(
    x=x_values,
    y=y_values,
    mode='markers',
    marker=dict(
        color='darkorange',
        size=12,
        line=dict(color='gray', width=1),
        symbol='circle',
        opacity=0.8
    ),
    line=dict(
        color='green',
        width=2,
        dash='dash'
    )
))

# Update layout
fig.update_layout(
    title='Emissions and Engine Size',
    xaxis_title='Engine Size',
    yaxis_title='Emissions',
    template='ggplot2'
)

4. Train and Test Set Procedure¶


In the context of simple linear regression (or regression modeling in general), the process of splitting data into training and test sets is crucial for evaluating the performance of the model. Here's a description of the train and test set procedure and why it is done:

  1. Data Splitting:

    • Start with a dataset that includes both input features (independent variable) and the corresponding target variable (dependent variable).
    • The dataset is split into two subsets: a training set and a test set.
  2. Training Set:

    • The training set is used to train (fit) the model. The model learns the patterns and relationships within the data by adjusting its parameters to minimize the difference between predicted and actual values.
  3. Test Set:

    • The test set is kept separate from the training process and is not used during the model fitting phase.
    • After the model is trained, it is evaluated on the test set to assess how well it generalizes to new, unseen data.

Why It Is Done:¶

  1. Model Evaluation:

    • Splitting the data allows you to evaluate your model on data it has never seen before. This provides a more realistic assessment of how well the model will perform on new, unseen data.
  2. Generalization:

    • The goal of a regression model is to generalize well to new, unseen data. By using a separate test set, you can assess the model's ability to generalize beyond the training data.
  3. Avoiding Overfitting:

    • Overfitting occurs when a model learns the training data too well, capturing noise and patterns that are specific to the training set but do not generalize. A test set helps identify if the model is overfitting.
  4. Parameter Tuning:

    • If you need to tune hyperparameters (e.g., regularization strength), you can use a validation set (a subset of the training set) for this purpose. The final evaluation is still performed on the test set.

Procedure in Python:¶

In Python, you can use libraries like scikit-learn to split your data into training and test sets. Here's an example:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Assuming X is your feature matrix and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model and fit it on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model.predict(X_test)

# Perform further evaluation (e.g., calculate metrics, visualize results)

By splitting your data into training and test sets, you can build and evaluate regression models more effectively, ensuring that they perform well on new, unseen data.

When using the train_test_split function from scikit-learn in Python, you don't need to manually create a mask. The function takes care of splitting your data into training and test sets randomly.

In this example:

  • train_test_split automatically splits your data into training and test sets based on the specified test_size (the proportion of the dataset to include in the test split).
  • random_state is used to ensure reproducibility. Setting a specific seed for random_state will result in the same split every time you run the code.

The function returns four sets: X_train, X_test, y_train, and y_test, which you can use for training and evaluating your regression model.

So, you don't need to create a mask manually; the function handles the data splitting for you.

In [51]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

train_x = np.asanyarray(df[['ENGINESIZE']])
train_y = np.asanyarray(df[['CO2EMISSIONS']])

# Assuming X is your feature matrix and y is your target variable
X_train, X_test, y_train, y_test = train_test_split(train_x, train_y, test_size=0.2, random_state=42)

# Create a linear regression model and fit it on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = model.predict(X_test)

print ('Coefficients: ', model.coef_)
print ('Intercept: ', model.intercept_)
Out[51]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Coefficients:  [[38.99297872]]
Intercept:  [126.28970217]

5. Evaluate Model Accuracy¶


Certainly! Here are the formulas rendered in LaTeX commands:

  1. Coefficient of Determination (R-squared):

    • R-squared represents the proportion of the variance in the dependent variable (target) that is explained by the independent variable (feature). It ranges from 0 to 1, where 1 indicates a perfect fit.
    • Formula: $$R^2 = 1 - \frac{SSR}{SST},$$
      where SSR is the sum of squared residuals and SST is the total sum of squares.
  2. Mean Squared Error (MSE) or Root Mean Squared Error (RMSE):

    • MSE measures the average squared difference between the predicted and actual values. RMSE is the square root of MSE, providing an interpretable measure in the same units as the target variable.
    • Formulas: $$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2,$$
      and
      $$RMSE = \sqrt{MSE},$$
      where $n$ is the number of observations.
  3. Mean Absolute Error (MAE):

    • MAE measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to MSE.
    • Formula: $$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|,$$ where $n$ is the number of observations.
  4. Residual Analysis and Residual Plots:

    • Examining the residuals (the differences between predicted and actual values) through residual plots helps identify patterns, heteroscedasticity, and outliers.
    • Residuals should be randomly distributed around zero with no clear patterns.
  5. Adjusted R-squared:

    • Adjusted R-squared adjusts the R-squared value to penalize the inclusion of irrelevant predictors. It is useful when dealing with multiple predictors.
    • Formula: $$\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(n - 1)}{(n - k - 1)},$$ where $n$ is the number of observations and $k$ is the number of predictors.
  6. F-statistic:

    • The F-statistic tests the overall significance of the regression model. It assesses whether the model explains a significant amount of variance in the dependent variable.
    • A high F-statistic with a low p-value indicates a significant model.

When interpreting these metrics, it's important to consider the specific characteristics of your data and the goals of your analysis. Additionally, using a combination of metrics provides a more comprehensive evaluation of model performance.

In [47]:
from sklearn.metrics import r2_score

# X_test = np.asanyarray(X_test)
# y_test = np.asanyarray(y_test)
y_pred = model.predict(X_test)

print("Mean absolute error: %.2f" % np.mean(np.absolute(y_pred - y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((y_pred - y_test) ** 2))
print("R2-score: %.2f" % r2_score(y_test, y_pred))
Mean absolute error: 24.10
Residual sum of squares (MSE): 985.94
R2-score: 0.76
In [113]:
x_values = df['ENGINESIZE']
y_values = df['CO2EMISSIONS']

# Create a scatter plot with styling
scatter_trace = go.Scatter(
    x=x_values,
    y=y_values,
    mode='markers',
    marker=dict(
        color='darkorange',
        size=12,
        line=dict(color='gray', width=1),
        symbol='circle',
        opacity=0.8
    ),
    name='Scatter Plot'
)

# Create a regression line trace
regression_line_trace = go.Scatter(
    x=x_values,
    y=np.polyval(np.polyfit(x_values, y_values, 1), x_values),
    mode='lines',
    line=dict(color='green', width=2, dash='dash'),
    name='Regression Line'
)

# Create the figure
fig = go.Figure(data=[scatter_trace, regression_line_trace])

# Update layout
fig.update_layout(
    title='Emissions and Engine Size',
    xaxis_title='Engine Size',
    yaxis_title='Emissions',
    template='ggplot2'
)
In [ ]: